The bioinformatics analysis pipeline nfcore/ampliseq is used for amplicon sequencing, supporting denoising of any amplicon and supports a variety of taxonomic databases for taxonomic assignment including 16S, ITS, CO1 and 18S.
Pipeline input was saved in folder input.
Sequencing data was provided in the samplesheet file
samplesheet.csv that is displayed below:
Metadata associated with the sequencing data was provided in
metadata.tsv and is displayed below:
FastQC gives general quality metrics about your sequenced reads. It provides information about the quality score distribution across your reads, per base sequence content (%A/T/G/C), adapter contamination and overrepresented sequences. The sequence quality was checked using FastQC and resulting data was aggregated using the FastQC module of MultiQC. For more quality controls and per sample quality checks you can check the full MultiQC report, which can be found in multiqc/multiqc_report.html.
Cutadapt is trimming primer sequences from sequencing reads. Primer sequences are non-biological sequences that often introduce point mutations that do not reflect sample sequences. This is especially true for degenerated PCR primer. If primer trimming were to be omitted, artifactual amplicon sequence variants might be computed by the denoising tool or sequences might be lost due to being labelled as PCR chimera. Primers were trimmed using cutadapt and all untrimmed sequences were discarded. Sequences that did not contain primer sequences were considered artifacts. Less than 2.2% of the sequences were discarded per sample and a mean of 98.2% of the sequences per sample passed the filtering. Cutadapt results can be found in folder cutadapt.
Additional quality filtering can improve sequence recovery. Often it is advised trimming the last few nucleotides to avoid less well-controlled errors that can arise there. Forward reads were trimmed at 215 bp and reverse reads were trimmed at 200 bp, reads shorter than this were discarded. Reads with more than 2 expected errors were discarded. Read counts passing the filter are shown in section ‘Read counts per sample’ column ‘filtered’.
Quality profiles:
Forward (left) and reverse (right) read quality stats for incoming data:
Forward (left) and reverse (right) read quality stats for preprocessed data:
Overall read quality profiles are displayed as heat map of the
frequency of each quality score at each base position. The mean quality
score at each position is shown by the green line, and the quartiles of
the quality score distribution by the orange lines. The red line shows
the scaled proportion of reads that extend to at least that position.
Original plots can be found in folder dada2/QC/ with names that end in
_qual_stats.pdf.
DADA2 performs fast and accurate sample inference from amplicon data with single-nucleotide resolution. It infers exact amplicon sequence variants (ASVs) from amplicon data with fewer false positives than many other methods while maintaining high sensitivity.
DADA2 reduces sequence errors and dereplicates sequences by quality filtering, denoising, read pair merging (for paired end Illumina reads only) and PCR chimera removal.
Read error correction was performed using estimated error rates, visualized below. Error rates for forward reads are at the left side and reverse reads are at the right side.
Estimated error rates are displayed for each possible transition. The
black line shows the estimated error rates after convergence of the
machine-learning algorithm. The red line shows the error rates expected
under the nominal definition of the Q-score. The estimated error rates
(black line) should be a good fit to the observed rates (points), and
the error rates should drop with increased quality. Original plots can
be found in folder dada2/QC/ with names that
end in .err.pdf.
Tracking read numbers through DADA2 processing steps for each sample. The following table shows the read numbers after each processing stage. Processing stages are: input - read pairs into DADA2, filtered - read pairs passed quality filtering, denoisedF - forward reads after denoising, denoisedR - reverse reads after denoising, merged - successfully merged read pairs, nonchim - read pairs in non-chimeric sequences (final ASVs).
Samples with unusual low reads numbers relative to the number of expected ASVs should be treated cautiously, because the abundance estimate will be very granular and might vary strongly between (theoretical) replicates due to high impact of stochasticity.
Following, the numbers of the table above are shown in stacked barcharts as percentage of DADA2 input reads. Stacked barchart of read pair numbers (denoisedF & denoisedR halfed, because each pair is split) per sample and processing stage:
Between 33.05% and 80.49% reads per sample (average 56.8%) were retained for analysis within DADA2 steps.
The proportion of lost reads per processing stage and sample should not be too high, totalling typically <50%. Samples that are very different in lost reads (per stage) to the majority of samples must be compared with caution, because an unusual problem (e.g. during nucleotide extraction, library preparation, or sequencing) could have occurred that might add bias to the analysis.
Finally, 1175 amplicon sequence variants (ASVs) were obtained across
all samples. The ASVs can be found in dada2/ASV_seqs.fasta. And the
corresponding quantification of the ASVs across samples is in dada2/ASV_table.tsv. An extensive
table containing both was saved as dada2/DADA2_table.tsv. ASVs were
initally inferred for each sample independently, but re-examined with
all samples (pseudo-pooled).
Barrnap classifies the ASVs into the origin domain (including mitochondrial origin).
Barrnap classified 0 ( 0 %) ASVs as most similar to Bacteria, 0 ( 0
%) ASVs to Archea, 0 ( 0 %) ASVs to Mitochondria, 0 ( 0 %) ASVs to
Eukaryotes, and 1175 ( 100 %) were below similarity threshold to any
kingdom.
rRNA classification results can be found in folder barrnap.
Data was processed using nf-core/ampliseq version 2.8.0 (doi: 10.5281/zenodo.1493841) (Straub et al., 2020) of the nf-core collection of workflows (Ewels et al., 2020), utilising reproducible software environments from the Bioconda (Grüning et al., 2018) and Biocontainers (da Veiga Leprevost et al., 2017) projects.
Data quality was evaluated with FastQC (Andrews, 2010) and summarized with MultiQC (Ewels et al., 2016). Cutadapt (Marcel et al., 2011) trimmed primers and all untrimmed sequences were discarded. Sequences that did not contain primer sequences were considered artifacts. Less than 2.2% of the sequences were discarded per sample and a mean of 98.2% of the sequences per sample passed the filtering. Adapter and primer-free sequences were processed initally independently, but re-examined as one pool (pseudo-pooled) with DADA2 (Callahan et al., 2016) to eliminate PhiX contamination, trim reads (forward reads at 215 bp and reverse reads at 200 bp, reads shorter than this were discarded), discard reads with > 2 expected errors, correct errors, merge read pairs, and remove polymerase chain reaction (PCR) chimeras; ultimately, 1175 amplicon sequencing variants (ASVs) were obtained across all samples. Between 33.05% and 80.49% reads per sample (average 56.8%) were retained. The ASV count table contained in total 282432 counts, at least 23884 and at most 42243 per sample (average 35304).
WARNING This methods section is lacking software versions, these can be found in MultiQC’s report section Software Versions or in folder pipeline_info file
software_versions.yml.
MultiQC summarized computational methods in multiqc/multiqc_report.html. The proposed short methods description can be found in MultiQC’s Methods Description, versions of software collected at runtime in MultiQC’s Software Versions, and a summary of non-default parameter in MultiQC’s Workflow Summary.
Technical information to the pipeline run are collected in folder pipeline_info, including software versions
collected at runtime in file software_versions.yml (can be
viewed with a text editor), all parameter settings in file
params_{date}_{time}.json (can be viewed with a text
editor), execution report in file
execution_report_{date}_{time}.html, execution trace in
file execution_trace_{date}_{time}.txt, execution timeline
in file execution_timelime_{date}_{time}.html, and pipeline
direct acyclic graph (DAG) in file
pipeline_dag_{date}_{time}.html.
This report (file summary_report.html) is located in
folder summary_report of the original pipeline results
folder. In this file, all links to files and folders are relative,
therefore hyperlinks will only work when the report is at its original
place in the pipeline results folder. Plots specifically produced for
this report (if any) can be also found in folder summary_report.
A comprehensive read count report throughout the pipeline can be
found in the base results folder in file
overall_summary.tsv.
Please cite the pipeline publication and any software tools used by the pipeline (see citations) when you use any of the pipeline results in your study.